1.Summary

This Tutorial is part of Stevens Data Cloud Project (SDC).

We will introduce following Database with instructions and configuration

Type Name Features Platform
Relational Database PostgreSQL Relation Database PC/AWS/Cloud Platform
Non-Relational Database MongoDB Column Database PC/AWS/Cloud Platform
Non-Relational Database Cassandra NoSQL PC/AWS/Cloud Platform
Public MiddleWare Kafka Message Queue PC/AWS/Cloud Platform
Private MiddleWare Kinesis Message Queue AWS ONLY


2.Table of Content

Here is tutorial list. You can click each item and get to that section directly

  1. Summary
  2. Table of content
  3. Source
  4. [Database] PostgreSQL
  5. [Database] MongoDB
  6. [Database] Cassandra
  7. [MiddleWare] Kafka
  8. [MiddleWare] Kinesis
  9. Build a web mining pipeline

Click this Link back to Top


3.Source

We list a lot of URL which maybe useful when you start using this tutorial.

Click this Link back to Top




4.[Database] PostgreSQL

Click this Link back to Top

PostgreSQL is open source relation database. It can be install in any platform. We will introduce installation on different platform: local windows, linux

4.1 Install PostgreSQL on Windows

Click this Link back to Top

Lastest PostgreSQL is 12, but we recommend old 10.14 version. Sometimes old version is more stable.

Here is download link:https://www.enterprisedb.com/downloads/postgres-postgresql-downloads

%load_ext sql

%sql postgresql://postgres:00wasabi00@127.0.0.1/salesdb

First command is used to load magic model. Second command is used to connect postgres. //user:password@ipaddress/databasename




4.2 Install PostgreSQL on Linu

Click this Link back to Top

Concretely, we can follow the offical guide line https://www.postgresql.org/download/linux/ubuntu/

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -

sudo apt-get update

sudo apt-get -y install postgresql-10




5.[Database] MongoDB

Click this Link back to Top

This Tutorial will introduce MongoDB, including install, connect and basic operation.

MongoDB is a noSQL database which we called document database. MongoDB don't use SQL table-column-attribute format, replacing with collection-key-value pair format. One MongoDB can contain different database like admin, local(defalut), 03_test_db, 04_geo_db etc. Each database can contain a lot of collections(tables), which each collection is isolated and contains a integrated JSON database file.

Traditionally, we can install MongoDb on:

We can connect MongoDB via various ways:

Here is catalogue of this tutrioal and each topic is coming from easy to hard.

Level Server Type Connection Method
Easy Windows Local Mongo Comparess + MongoShell + PyMongo
Recommend Mongo Altas Cloud Cluster Mongo Comparess + MongoShell + PyMongo
Normal AWS EC2 Cloud Non-Cluster Mongo Comparess + MongoShell + PyMongo
Hard AWS DocumentDB Cloud Cluster MongoShell + PyMongo

5.1 Install MongoDB on Windows (local)[EASY]

Click this Link back to Top




5.2 Install MongoDB on Mongo Altas [RECOMMEND]

Click this Link back to Top

  1. MongoDB provide a free cloud version called Altras, but only have 500MB very small
  2. Create cluster first
  3. then we need to create user for accessing database
  4. Build a new IP whitelist. For instacne, 0.0.0.0/0 means accept all ip address from world




5.3 Install MongoDB on AWS EC2 [NORMAL]

Typically, we can install MongoDB on an new EC2 directly. But we will setup an entire MongoDB in next section, so we recoomend to use AWS CloudFormation to build an MongoDB non-cluster version

Click this Link back to Top




5.4 Install MongoDB on AWS DocumentDB

Amazon use MongoDB as foundation to build his own DocumentDB with culster.

Click this Link back to Top

5.4.1 Prerequisties

Click this Link back to Top


5.4.2 Create a DocumentDB Cluster

Click this Link back to Top

remember to stop culster, it will cost money


5.4.3 Launch an AWS EC2 Instance (login client)

Click this Link back to Top



5.4.4 Access Amazon DocumentDB by mongo shell

Click this Link back to Top

</ol>


5.4.5 Use Windows like Linux

Click this Link back to Top

From former chapter, we only demonstrate how to connect on Windows. In this chapter, i will introduct how to connect AWS EC2 on Mac or Linux. Microsoft have add a lot of Linux core in Windows 10. Actually we can use cmd like Linux terminal, so all my screenshot from windows laptop, but the logical will be identical.

ssh -i EC2_0610.pem ubuntu@ec2-52-72-13-133.compute-1.amazonaws.com

use this commond we can access EC2 without putty in Windows

Image(filename='04_image/47.png')




5.5 Connect Mongo Atlas with MongoDB Compaess + MongoShell + PyMongo

Click this Link back to Top

From previous chapters, we have konw how to install MongoDB on different platforms. We will start introduce how to connect them with different tools

5.5.1 Atlas + MongoDB Compaess

Click this Link back to Top

Mongo Compass can be install when you install MongoDB. It's a GUI ternimal




5.5.2 Atlas + MongoDB MongoShell

Click this Link back to Top

Connect Mongo Atlas with MongoDB MongoShell

we start from import json into our cluster. This is basic commend string:

mongoimport --host atlas-4ypdkm-shard-0/cluster4-shard-00-00.62ji5.mongodb.net:27017,cluster4-shard-00-01.62ji5.mongodb.net:27017,cluster4-shard-00-02.62ji5.mongodb.net:27017 \ --ssl \ --username mongoadmin \ --password mongoadmin \ --authenticationDatabase admin \ --db covid \ --collection jan-one \ --type json \ --file D:\Downloads\covid\16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json

  1. --hostthis is shard cluster, so atlas-4ypdkm-shard-0 is the name of shard and the rest is names of three cluster point
  2. --sslmenas this security connection
  3. --username --password daatabase useranme and password
  4. --db the name of database you want to create
  5. --collection the name of collection you want to create under database, preferred not number
  6. --type Json, CSV or something else
  7. --file file path of source data
WHEN YOU FACE ERROR, TRY TO DELETE SLASH "/"







5.5.3 Atlas + PyMongo

Click this Link back to Top

Connect Mongo Atlas with PyMongo, we can use python to directly access remote database. Here is code.



5.6 Connect AWS EC2 with Mongo Comparess + MongoShell + PyMongo

Click this Link back to Top

  1. runmongod (if you add mongo dictornary into enviroment path)
  2. MongoDB will run in this commond line or new commond line, if you want access this local database, you should run mongo




5.7 Connect AWS DocumentDB with MongoShell and PyMongo

In this part we contain two part: first is establish jupyter on EC2 and second is local pymongo basic

Click this Link back to Top

5.7.1 Jupyter Notebook on EC2

Click this Link back to Top

From previous chapter, we know that we can't access AWS DocumentDB from outside of VPC directly except ssh turnel. So we build a jupyter notebook on EC2 and manipulate DocumentDB by pymongo package

get Anaconda3 package wget https://repo.anaconda.com/archive/Anaconda3-2020.02-Linux-x86_64.sh

conda activate

ipython

'sha1:2b4e5095a913:2ff43c1fc73822d6c98dfa86cdef3b26fcbe2f3c'

openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem'sha1:6741a01d5f68:c4f99e8a386c093b8f3123099f8b9a860dfa1fa5'

vi jupyter_notebook_config.py

screen

sudo chown $USER:$USER /home/ubuntu/certs/mycert.pem

jupyter notebook

screen

jupyter notebook

https://ec2-52-72-13-133.compute-1.amazonaws.com:8888 password=admin




5.7.2 Connect AWS DocumentDB with pymongo

Click this Link back to Top

  1. AWS Document:https://docs.aws.amazon.com/documentdb/latest/developerguide/connect_programmatically.html
  2. we should initial jupyter running enviroment
    1. !conda install -c anaconda pymongo --yes
    2. !conda install --yes -c anaconda dnspython
    3. !conda install --yes -c jmcmurray json
    4. import pymongo
    5. import dns
    6. import json
    7. f### 5.7.1 Jupyter Notebook on EC2 Click this Link back to Topom pymongo import MongoClient
    8. client = pymongo.MongoClient('mongodb://mongoadmin:mongoadmin@docdb-2020-06-11-03-01-07.cluster-crp0v4bay2br.us-east-1.docdb.amazonaws.com:27017/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred')
    9. client.list_database_names()
  3. mongoadmin:mongoadmin are culster username and password
  4. docdb-2020-06-11-03-01-07.cluster-crp0v4bay2br.us-east-1.docdb.amazonaws.com are cluster endpoint
  5. ssl_ca_certs=rds-combined-ca-bundle.pem is TLS key pair


https://docs.aws.amazon.com/documentdb/latest/developerguide/connect-from-outside-a-vpc.html

In ths part we introduce basic commond of Pymongo

client.list_database_names()

client_cluster = MongoClient("mongodb://admin:admin@cluster-test-03-62ji5.mongodb.net/test:27017") client_cluster.list_database_names()

limited_result = collection1.find().limit(1) for in limitedresult: print()

MongoDB use JSON document to restore data. In PyMongo we use dictionaries to represent documents

To insert a document into a collection we use insert_one() method

When the document is insert,a special key_id is generated and its unique to this document.

The articales collection is created after inserting the first document. We can confirm this using the list_collection_namesmethod

We can insert multiple documents to a collection using the insert_many() method as shown below:

find_one() returns a single document matching the query or none if it doesn't exist. This method returns the first match that it comes across. When we call the method below, we get the first article we inserted into our collection.

We can use the sort() method to sort the results in ascending or descending order. The default order is ascending. We use 1 to signify ascending and -1 to signify descending

we update a document using the update_one() method. The first paratmeter taken by this function is a query object defining the document to be updated. If the method finds more than one document. it will only update the first one. Let's update the name of the author in the article written by Derrick.

MongoDB enables us to limit the result of our query using the limit method. In our query below we'll limit the result to one record.




user:mongoadmin

password: mongoadmin

  1. Check Secrity Group (Allow SSH and maybe ICMP for ping test)
  2. Check key pair(private ppk for windows, pem for Linux)

crate time 06102020

EC2 Public Key: ubuntu@ec2-52-72-13-133.compute-1.amazonaws.com

Putty: putty -ssh -i EC2_0610.ppk ubuntu@ec2-52-72-13-133.compute-1.amazonaws.com

Copy from local to EC2: scp -i EC2_0610.pem D:\Downloads\covid\16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json ubuntu@ec2-52-72-13-133.compute-1.amazonaws.com:/home/ubuntu/Notebooks

Jupyter notebook:https://ec2-52-72-13-133.compute-1.amazonaws.com:8888/

Jupyter Password:admin

MongoDB Username:mongoadmin MongoDB Password:mongoadmin

IMPORT JSON INOT DocumentDB mongoimport --ssl \ --host docdb-2020-06-11-03-01-07.cluster-crp0v4bay2br.us-east-1.docdb.amazonaws.com:27017 \ --sslCAFile rds-combined-ca-bundle.pem \ --username=mongoadmin \ --password=mongoadmin \ --collection=col-data \ --db=covid \ --file=16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json \ --numInsertionWorkers 4 \

mongoimport --uri "mongodb://mongoadmin:mongoadmin@docdb-2020-06-11-03-01-07.cluster-crp0v4bay2br.us-east-1.docdb.amazonaws.com:27017/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=myAtlasRS&authSource=admin" -d covid -c coldata -file 16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json --type json --ssl


mongoimport --host docdb-2020-06-11-03-01-07/docdb-2020-06-11-03-01-07.cluster-crp0v4bay2br.us-east-1.docdb.amazonaws.com:27017 --ssl --sslCAFile rds-combined-ca-bundle.pem --username mongoadmin --password mongoadmin --db covid --collection 20200101 --type json --file 16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json


mongo --ssl \ --host docdb-2020-06-11-03-01-07.cluster-crp0v4bay2br.us-east-1.docdb.amazonaws.com:27017 \ --sslCAFile rds-combined-ca-bundle.pem \ --username mongoadmin \ --password mongoadmin

***ATLAS**

mongo "mongodb+srv://cluster-test-03-62ji5.mongodb.net/sample_airbnb" --username admin --password admin

mongoimport --host Cluster-test-03-shard-0/cluster-test-03-shard-00-00-62ji5.mongodb.net:27017,cluster-test-03-shard-00-01-62ji5.mongodb.net:27017,cluster-test-03-shard-00-02-62ji5.mongodb.net:27017 --ssl --username admin --password admin --authenticationDatabase admin --db covid --collection 20200101 --type json --file 16119_webhose_2020_01_db21c91a1ab47385bb13773ed8238c31_0000001.json





6.[Database] Cassandra

Click this Link back to Top

This tutorial will introduce how to install Apache Kafka, Cassandra and Kinesis. Here is reference list:

  1. Apache Cassandra Official Document:https://cassandra.apache.org
  2. Apache Kafka Offical Docuemnt:

Cassandra is an open source NoSQL database. In case we face compatibility problem, we recommend you use Ubuntu 20.04 as our operation system. We built a one cluster Cassandra database first and then configure it into two cluster. We use AWS to demenstrate

If you need more specifiy detail, please check offical document: https://cassandra.apache.org


6.1 Inital Operation System

Click this Link back to Top

IMPORTANT! Due to defalut setting, OS must have at least 8GB memory




6.2 Java Environment

Click this Link back to Top

make sure install jave environment sudo apt-get install openjdk-8-jre-headless

make sure you java install pathpwd</li>

  • paste and record path/usr/lib/jvm/java-8-openjdk-amd64
  • configue envrionment variable to make sure Cassandra can find correct java versionsudo nano ~/.bashrc</li>




    6.3 Debian Package

    Click this Link back to Top




    6.4 Trouble Shooting

    Click this Link back to Top

    When you facing issues about cannot start Cassandra correctly, i recommend you use log to trouble-shooting. Default log restore path is /var/log/cassandra/, and you can use grep 'WARN\|ERROR' /var/log/cassandra/system.log | tail to filter JAVA error about WARNING AND ERRORS

    Default install path of Cassandra is /etc/cassandra and most of configuration of Cassandra is writing in /etc/cassandra/config.yaml</cdoe>